Lab Assignment: Reinforcement learning¶
Lab team: teamCode¶
Name (member 1): Héctor Tablero Díaz¶
Name (member 2): Álvaro Martínez Gamo¶
- Please include your full name at the beginning of all submitted files.
- Make sure the presentation is well-structured: the report will be evaluated not only for correctness, but also for clarity, conciseness, and completeness.
- Make use of figures and tables to summarize the results and illustrate the discussions.
- If external material is used, the sources must be cited.
- Include references in APA format https://pitt.libguides.com/citationhelp/apa7. Lack or poorly formatted references can be penalized.
- A generative AI tool can be used for consultation. You must specify the tool used in your report.
- You are not allowed to use a generative AI tool to generate code.
Submit a single .zip file, whose name has the format AA3_2024_2025_P03_teamCode_lastName1_lastName2.zip
The name must not include graphical accents, spaces, uppercase letters, or special characters.
For example: AA3_2024_2025_P03_V03_munyoz_deLaRosa.zip
This compressed file must include the following files:
- This Python notebook with the solutions of the exercises. The notebook should include only code snippets, figures, tables, derviations and explanations (with LaTex if necessary) in Markdown cells. Handwritten material can be included in the Python notebook as images. Functions should be defined in a separate
.pyfile, not in the notebook. - The necessary
.pyand addional files to ensure the Python notebook code can be executed sequentially without errors. - A PDF file generated from the notebook (Export the notebook as an HTML file. Open the HTML file in a Browser and print it as a PDF file).
Make sure that all the code cells can be executed squentially without errors (Kernel -> Restart & Run All). Exectution and formatting errors will be penalized.
The grade of this lab assignment is based on
- This submission (50 %).
- An individual in-class exam (50%).
Evaluation criteria:
- [6 points] Quality of the report (correctness, clarity, conciseness, completeness).
- [3 points] Quality of the code (correctness, adherence to a Python style guide -for instance, Google's-, comments, functional decomposition).
- [1 point] References.
Training a reinforcemnt learning agent at the gymnasium 🚀¶
Exercise 1: The gymnasium¶
TO DO: Complete the gymnasium tutorials¶
- Basic usage: https://gymnasium.farama.org/introduction/basic_usage/
- Training an RL-agent: https://gymnasium.farama.org/introduction/train_agent/
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
import imageio
import os
from tqdm.notebook import tqdm
import rl_utils
import time
from IPython.display import display, clear_output
from reinforcement_learning import (
sarsa_learning,
q_learning,
greedy_policy,
epsilon_greedy_policy,
)
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
# Install packages if needed
# %pip install pygame
# %pip install gymnasium
Step 0: Set up and understand Frozen Lake environment¶
Let's begin with a simple 4x4 map and non-slippery, meaning the agent always moves in the desired direction.
We add a parameter called render_mode that specifies how the environment should be visualised. In our case because we want to record a video of the environment at the end, we need to set render_mode to rgb_array.
As explained in the documentation “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (H, W, 3) representing RGB values for an H (height) times W (width) pixel image.
small_environment = gym.make(
'FrozenLake-v1',
map_name='4x4',
is_slippery=False,
render_mode='rgb_array',
)
Let's see what the environment looks like:¶
state, info = small_environment.reset() # observation state
action_names = {0: 'Left', 1: 'Down', 2: 'Right', 3: 'Up'}
print(state)
print(info, '(Probability that the action has led to the current state)')
n_states = small_environment.observation_space.n
print("There are ", n_states, " possible states")
n_actions = small_environment.action_space.n
print("There are ", n_actions, " possible actions")
print(action_names)
fig, ax = plt.subplots()
game_image = ax.imshow(small_environment.render())
0
{'prob': 1} (Probability that the action has led to the current state)
There are 16 possible states
There are 4 possible actions
{0: 'Left', 1: 'Down', 2: 'Right', 3: 'Up'}
# Generate a random observed state
print("Observation state (randomly selected)", small_environment.observation_space.sample())
# Generate a random action from the current state
print("Action (randomly selected):", small_environment.action_space.sample())
Observation state (randomly selected) 5 Action (randomly selected): 0
# Generate an episode
obsevation, info = small_environment.reset() # state state_state
fig, ax = plt.subplots()
game_image = ax.imshow(small_environment.render())
MAX_STEPS = 100
refresh_rate = 1 # in (1 / seconds)
episode_over = False
n_steps = 0
while not episode_over and n_steps < MAX_STEPS:
n_steps += 1
action = small_environment.action_space.sample()
state, reward, terminated, truncated, info = small_environment.step(action)
episode_over = terminated or truncated
ax.set_title(
'Step: {} State: {} Reward: {} Action:{}'.format(
n_steps,
state,
reward,
action_names[action],
)
)
display(fig)
time.sleep(1.0 / refresh_rate)
clear_output(wait=True) # Clear previous output
game_image.set_data(small_environment.render())
small_environment.close()
Step 1: Greedy and Epsilon greedy policies¶
Since Q-Learning is an off-policy algorithm, we have two policies. This means we're using a different policy for acting and updating the value function.
- Epsilon-greedy policy (acting policy)
- Greedy-policy (updating policy)
The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.
Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
With probability 1 - ɛ : we do exploitation (i.e. our agent selects the action with the highest state-action pair value).
With probability ɛ: we do exploration (trying a random action).
As the training continues, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.
TO DO: Implement these policies in reinforcement_learning.py¶
Define the hyperparameters for the learning process ⚙️¶
The exploration related hyperparamters are some of the most important ones.
- We need to make sure that our agent explores enough of the state space to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), you take the risk that your agent will be stuck in a local optimum, since your agent didn't explore enough of the state space and hence can't solve the problem.
# Training hyperparameters
n_training_episodes = 1000
max_steps = 100 # Maximum number of steps per episode
learning_rate = 0.7
gamma = 0.95 # Discount factor
# Exploration parameters
max_epsilon = 1.0 # Initial exploration probability
min_epsilon = 0.05 # Minimum exploration probability
decay_rate = 0.0005 # Exponential decay rate for the exploration probability
# Initialize Q-table
Qtable_small = np.zeros(
(
small_environment.observation_space.n,
small_environment.action_space.n
)
)
# Learn Q-table
Qtable_small = q_learning(
small_environment,
n_training_episodes,
max_steps,
learning_rate,
gamma,
min_epsilon,
max_epsilon,
decay_rate,
Qtable_small,
)
0%| | 0/1000 [00:00<?, ?it/s]
Let's see what our Q-Learning table looks like now 👀¶
Qtable_small
array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
[0.73509189, 0. , 0.81450625, 0.77378094],
[0.77378094, 0.857375 , 0.77378094, 0.81450625],
[0.81450625, 0. , 0.77378094, 0.77378094],
[0.77378094, 0.81450625, 0. , 0.73509189],
[0. , 0. , 0. , 0. ],
[0. , 0.9025 , 0. , 0.81450625],
[0. , 0. , 0. , 0. ],
[0.81450625, 0. , 0.857375 , 0.77378094],
[0.81450625, 0.9025 , 0.9025 , 0. ],
[0.857375 , 0.95 , 0. , 0.857375 ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0.9025 , 0.95 , 0.857375 ],
[0.9025 , 0.95 , 1. , 0.9025 ],
[0. , 0. , 0. , 0. ]])
...and what our agent is doing!¶
frames = rl_utils.generate_greedy_episode(small_environment, Qtable_small)
rl_utils.show_episode(frames, interval=250)
A more challenging problem¶
We're ready now to find our way in more challenging environments 💥
large_environment = gym.make(
'FrozenLake-v1',
map_name='8x8',
is_slippery=False,
render_mode='rgb_array'
)
n_states = large_environment.observation_space.n
print("There are ", n_states, " possible states")
n_actions = large_environment.action_space.n
print("There are ", n_actions, " possible actions")
There are 64 possible states There are 4 possible actions
# TO DO: Training hyperparameters
# Training hyperparameters
n_training_episodes = 10000
max_steps = 700 # Maximum number of steps per episode
learning_rate = 0.7
gamma = 0.99 # Discount factor
# Exploration parameters
max_epsilon = 1.0 # Initial exploration probability
min_epsilon = 0.1 # Minimum exploration probability
decay_rate = 0.00001 # Exponential decay rate for the exploration probability
# Initialize Q-table
Qtable_large = np.zeros(
(
large_environment.observation_space.n,
large_environment.action_space.n
)
)
# Learn Q-table
Qtable_large = q_learning(
large_environment,
n_training_episodes,
max_steps,
learning_rate,
gamma,
min_epsilon,
max_epsilon,
decay_rate,
Qtable_large,
)
print(Qtable_large)
0%| | 0/10000 [00:00<?, ?it/s]
[[0.86874581 0.87752102 0.87752102 0.86874581] [0.86874581 0.88638487 0.88638487 0.87752102] [0.87752102 0.89533825 0.89533825 0.88638487] [0.88638487 0.90438208 0.90438208 0.89533825] [0.89533825 0.91351725 0.91351725 0.90438208] [0.90438208 0.92274469 0.92274469 0.91351725] [0.91351725 0.93206535 0.93206535 0.92274469] [0.92274469 0.94148015 0.93206535 0.93206535] [0.87752102 0.88638487 0.88638487 0.86874581] [0.87752102 0.89533825 0.89533825 0.87752102] [0.88638487 0.90438208 0.90438208 0.88638487] [0.89533825 0. 0.91351725 0.89533825] [0.90438208 0.92274469 0.92274469 0.90438208] [0.91351725 0.93206535 0.93206535 0.91351725] [0.92274469 0.94148015 0.94148015 0.92274469] [0.93206535 0.95099005 0.94148015 0.93206535] [0.88638487 0.89533825 0.89533825 0.87752102] [0.88638487 0.90438208 0.90438208 0.88638487] [0.89533825 0.91351725 0. 0.89533825] [0. 0. 0. 0. ] [0. 0.93206535 0.93206535 0.91351725] [0.92274469 0. 0.94148015 0.92274469] [0.93206535 0.95099005 0.95099005 0.93206535] [0.94148015 0.96059601 0.95099005 0.94148015] [0.89533825 0.88638487 0.90438208 0.88638487] [0.89533825 0.89533825 0.91351725 0.89533825] [0.90438208 0.90438208 0.92274469 0.90438208] [0.91351725 0. 0.93206535 0. ] [0.92274469 0.94148015 0. 0.92274469] [0. 0. 0. 0. ] [0. 0.96059601 0.96059601 0.94148015] [0.95099005 0.970299 0.96059601 0.95099005] [0.88638487 0.87752102 0.89533825 0.89533825] [0.88638487 0. 0.90438208 0.90438208] [0.89533825 0. 0. 0.91351725] [0. 0. 0. 0. ] [0. 0.93206535 0.95099005 0.93206535] [0.94148015 0.94148015 0.96059601 0. ] [0.95099005 0. 0.970299 0.95099005] [0.96059601 0.9801 0.970299 0.96059601] [0.87752102 0.86874581 0. 0.88638487] [0. 0. 0. 0. ] [0. 0. 0. 0. ] [0. 0.73556949 0.93205841 0. ] [0.92270685 0. 0.94148015 0.94148015] [0.93206535 0.92948817 0. 0.95099005] [0. 0. 0. 0. ] [0. 0.99 0.9801 0.970299 ] [0.86874581 0.86005835 0. 0.87752102] [0. 0. 0. 0. ] [0. 0.69084459 0.27828117 0. ] [0.62238189 0. 0. 0.74439545] [0. 0. 0. 0. ] [0. 0. 0. 0.93919232] [0. 0. 0. 0. ] [0. 1. 0.99 0.9801 ] [0.86005835 0.86005835 0.85145777 0.86874581] [0.86005835 0.85145777 0.84244587 0. ] [0.851395 0.83696947 0. 0.67964018] [0. 0. 0. 0. ] [0. 0. 0. 0. ] [0. 0. 0. 0.64716085] [0. 0. 0.7 0. ] [0. 0. 0. 0. ]]
frames = rl_utils.generate_greedy_episode(large_environment, Qtable_large)
rl_utils.show_episode(frames, interval=250)
Slippery environment¶
Our environment is now slippery, meaning the agent sometimes slips and moves in an unintended direction
slippery_environment = gym.make(
'FrozenLake-v1',
map_name='8x8',
is_slippery=True,
render_mode='rgb_array',
)
To visualize the challenges in this environment, let's make our agent go right several times and see what happens:
go_right = []
action = 2 # go-right
n_steps = 20
state, info = slippery_environment.reset()
for i in range(n_steps):
go_right.append(slippery_environment.render()) # Capture current frame (RGB array)
slippery_environment.step(action)
rl_utils.show_episode(go_right, interval=250)
Let's see what happens when we train our agent in the same way as before.
Procedemos a volver a cambiar los hiperparametros de tal forma que el agente llegue al objetivo con el atributo de
is_slipperyen True.
# TO DO: Training hyperparameters
####################################################################################
n_training_episodes = 100000
max_steps = 800 # Maximum number of steps per episode
learning_rate = 0.1
gamma = 0.995 # Discount factor
# Exploration parameters
max_epsilon = 1.0 # Initial exploration probability
min_epsilon = 0.5 # Minimum exploration probability
decay_rate = 0.00001 # Exponential decay rate for the exploration probability
####################################################################################
# Initialize Q-table
Qtable_slippery = np.zeros(
(
slippery_environment.observation_space.n,
slippery_environment.action_space.n
)
)
# Learn Q-table
Qtable_slippery = q_learning(
slippery_environment,
n_training_episodes,
max_steps,
learning_rate,
gamma,
min_epsilon,
max_epsilon,
decay_rate,
Qtable_slippery,
)
print(Qtable_slippery)
0%| | 0/100000 [00:00<?, ?it/s]
[[0.62087092 0.62506661 0.62355789 0.62424558] [0.62482788 0.62854348 0.63442318 0.63261529] [0.63796956 0.64427411 0.65087516 0.64312588] [0.65379602 0.65955749 0.67183619 0.6609162 ] [0.66743682 0.67853085 0.68766786 0.677755 ] [0.6876953 0.69391986 0.70179005 0.69990331] [0.70730817 0.70932031 0.71492909 0.71156718] [0.71763026 0.71831774 0.71965027 0.71465207] [0.61558592 0.61961474 0.6215698 0.62231027] [0.61889508 0.6230909 0.6256971 0.62748008] [0.62044576 0.63004826 0.61809804 0.64079178] [0.35814726 0.4251838 0.46566673 0.66145333] [0.61196346 0.63607365 0.61297329 0.68097643] [0.67657451 0.68583872 0.69543142 0.69836548] [0.71280634 0.71340901 0.72356263 0.71705185] [0.72833349 0.72577279 0.72911908 0.72027659] [0.60359796 0.59787725 0.6088383 0.61260175] [0.60810651 0.58247838 0.60581225 0.60267419] [0.57711642 0.40519987 0.40369006 0.50240981] [0. 0. 0. 0. ] [0.35198865 0.31494539 0.5447513 0.45456124] [0.37008358 0.39239404 0.48336473 0.66339106] [0.69988729 0.69920787 0.72825994 0.7140008 ] [0.75586218 0.74542046 0.74650099 0.73854304] [0.5768683 0.57015431 0.57586754 0.58919573] [0.53889906 0.49848585 0.50101247 0.5613914 ] [0.42310458 0.39021583 0.45021492 0.48457981] [0.18759403 0.25649913 0.19463911 0.30251752] [0.38131222 0.26273966 0.31709065 0.28397099] [0. 0. 0. 0. ] [0.35913928 0.6234058 0.72057944 0.45346981] [0.77922039 0.76608056 0.78829545 0.74275637] [0.57086031 0.497053 0.52046134 0.5420394 ] [0.3040873 0.33368459 0.28927896 0.39300231] [0.19436512 0.11185114 0.11196678 0.23545422] [0. 0. 0. 0. ] [0.23294793 0.13503629 0.38529395 0.23451031] [0.24533582 0.47143891 0.21950151 0.27995966] [0.46666389 0.35080327 0.53256974 0.61444255] [0.79758616 0.78485351 0.82347345 0.7553707 ] [0.53019022 0.31981937 0.20241182 0.40680183] [0. 0. 0. 0. ] [0. 0. 0. 0. ] [0.02054712 0.08614404 0.1474956 0.05577082] [0.23078398 0.17067911 0.14315001 0.29474312] [0.28838817 0.14759741 0.20152376 0.24729451] [0. 0. 0. 0. ] [0.57981245 0.62188414 0.8728102 0.49103802] [0.5114497 0.25808444 0.38685574 0.36733391] [0. 0. 0. 0. ] [0.07490542 0.08617663 0.10563898 0.00794854] [0.04875839 0.01769428 0.01762159 0.05760415] [0. 0. 0. 0. ] [0.1938274 0.12646469 0.20145392 0.09340322] [0. 0. 0. 0. ] [0.53090387 0.78307094 0.93835779 0.7208812 ] [0.4981002 0.4471647 0.44656039 0.40380092] [0.26784763 0.33283483 0.20355949 0.15106975] [0.21222233 0.24915029 0.13175168 0.15087429] [0. 0. 0. 0. ] [0.01415237 0.06375206 0.05459999 0.06632464] [0.17875288 0.37603372 0.32212115 0.28523404] [0.24048697 0.63439276 0.28953971 0.02468867] [0. 0. 0. 0. ]]
frames = rl_utils.generate_greedy_episode(slippery_environment, Qtable_slippery)
rl_utils.show_episode(frames, interval=250)